The categories in the BioLog dataset are as follows:

Column ID Description
Sample.ID The location the sample was taken from. There are 2 water samples and 2 soil samples.
Rep The experimental replicate. 3 replicates for each combination of experimental variables.
Well The well number on the BioLog plate.
Dilution The dilution factor of the sample.
Substrate The name of the carbon source in that well. “Water” is the negative control.
Hr_24 The light absorbance value after 24 hours of incubation.
Hr_48 The light absorbance value after 48 hours of incubation.
Hr_144 The light absorbance value after 144 hours of incubation.

Here is the start of the BioLog dataset as an example:

##     Sample.ID Rep Well Dilution                   Substrate Hr_24 Hr_48
## 1 Clear_Creek   1   A1    0.001                       Water 0.000 0.000
## 2 Clear_Creek   1   A2    0.001       β-Methyl-D- Glucoside 0.004 0.005
## 3 Clear_Creek   1   A3    0.001 D-Galactonic Acid γ-Lactone 0.008 0.007
## 4 Clear_Creek   1   A4    0.001                  L-Arginine 0.003 0.002
## 5 Clear_Creek   1   B1    0.001   Pyruvic Acid Methyl Ester 0.002 0.000
## 6 Clear_Creek   1   B2    0.001                    D-Xylose 0.011 0.008
##   Hr_144
## 1  0.000
## 2  0.004
## 3  0.001
## 4  0.000
## 5  0.007
## 6  0.021

The first question we want to answer is whether the samples are fundamentally different from each other.

One way to do this is by making a plot:

Overall, they don’t look very different. Let’s try a facet wrap for each substrate.

In the case of some substrates, the water and soil samples look very different from each other. We can also look at a t test of each sample compared to the others. Let’s start by subsetting the data into datasets for each sample ID.

Clear_creek <- BioLog[BioLog$Sample.ID == "Clear_Creek", ]
Soil1 <- BioLog[BioLog$Sample.ID == "Soil_1",]
Soil2 <- BioLog[BioLog$Sample.ID == "Soil_2",]
Waste_water <- BioLog[BioLog$Sample.ID == "Waste_Water",]

Next we will perform a t test on the soil samples and on the water samples, separately.

t.test(Clear_creek$Hr_144, Waste_water$Hr_144)
## 
##  Welch Two Sample t-test
## 
## data:  Clear_creek$Hr_144 and Waste_water$Hr_144
## t = -4.5818, df = 555.18, p-value = 5.696e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3675931 -0.1469902
## sample estimates:
## mean of x mean of y 
## 0.3511493 0.6084410
t.test(Soil1$Hr_144, Soil2$Hr_144)
## 
##  Welch Two Sample t-test
## 
## data:  Soil1$Hr_144 and Soil2$Hr_144
## t = 0.88529, df = 573.97, p-value = 0.3764
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07108979  0.18776340
## sample estimates:
## mean of x mean of y 
##  1.399306  1.340969

This shows that the water samples are different from each other, while the soil samples are not.

Next, let’s compare the water samples to a soil sample and see how they compare.

t.test(Clear_creek$Hr_144, Soil1$Hr_144)
## 
##  Welch Two Sample t-test
## 
## data:  Clear_creek$Hr_144 and Soil1$Hr_144
## t = -17.867, df = 539.62, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.1633916 -0.9329209
## sample estimates:
## mean of x mean of y 
## 0.3511493 1.3993056
t.test(Waste_water$Hr_144, Soil1$Hr_144)
## 
##  Welch Two Sample t-test
## 
## data:  Waste_water$Hr_144 and Soil1$Hr_144
## t = -12.47, df = 571.07, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9154274 -0.6663018
## sample estimates:
## mean of x mean of y 
##  0.608441  1.399306

This shows that both water samples are fundamentally different from the first soil sample. Let’s try it with the other soil sample:

t.test(Clear_creek$Hr_144, Soil2$Hr_144)
## 
##  Welch Two Sample t-test
## 
## data:  Clear_creek$Hr_144 and Soil2$Hr_144
## t = -16.794, df = 537.82, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.1055958 -0.8740431
## sample estimates:
## mean of x mean of y 
## 0.3511493 1.3409688
t.test(Waste_water$Hr_144, Soil2$Hr_144)
## 
##  Welch Two Sample t-test
## 
## data:  Waste_water$Hr_144 and Soil2$Hr_144
## t = -11.504, df = 570.44, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8575905 -0.6074650
## sample estimates:
## mean of x mean of y 
##  0.608441  1.340969

This yields similar results to the first soil and water t test set.

Next Question: Are the soil samples significantly different from the water samples?

A tukey test is likely the best way to compare these.

For the first part, we will compare the samples to each other overall:

summary(mod1)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## Sample.ID      3  238.3   79.44   147.2 <2e-16 ***
## Residuals   1148  619.6    0.54                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(mod1)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Hr_144 ~ Sample.ID, data = BioLog)
## 
## $Sample.ID
##                                diff         lwr         upr     p adj
## Soil_1-Clear_Creek       1.04815625  0.89065127  1.20566123 0.0000000
## Soil_2-Clear_Creek       0.98981944  0.83231447  1.14732442 0.0000000
## Waste_Water-Clear_Creek  0.25729167  0.09978669  0.41479664 0.0001665
## Soil_2-Soil_1           -0.05833681 -0.21584178  0.09916817 0.7761474
## Waste_Water-Soil_1      -0.79086458 -0.94836956 -0.63335961 0.0000000
## Waste_Water-Soil_2      -0.73252778 -0.89003276 -0.57502280 0.0000000

The summary(mod1) command tells us there is a difference between the samples, and the TukeyHSD(mod1) command lets us know which samples are significantly different from each other. In this case, the soil samples are the only ones that are not significantly different from each other; when comparing two sample types against each other, all others have significant results

This can also be illustrated with a graph:

plot(TukeyHSD(mod1))

The entries with a part around 0.0 are not functionally different. There is only one entry that is centered around 0, which matches the results from the Tukey test.

Below is the data for each individual substrate.

TukeyHSD(aov(X$`2-Hydroxy Benzoic Acid`$Hr_144~X$`2-Hydroxy Benzoic Acid`$Sample.ID))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = X$`2-Hydroxy Benzoic Acid`$Hr_144 ~ X$`2-Hydroxy Benzoic Acid`$Sample.ID)
## 
## $`X$`2-Hydroxy Benzoic Acid`$Sample.ID`
##                                diff        lwr        upr     p adj
## Soil_1-Clear_Creek       1.41855556  0.7626908  2.0744203 0.0000094
## Soil_2-Clear_Creek       1.17211111  0.5162463  1.8279759 0.0001770
## Waste_Water-Clear_Creek -0.01755556 -0.6734203  0.6383092 0.9998601
## Soil_2-Soil_1           -0.24644444 -0.9023092  0.4094203 0.7401743
## Waste_Water-Soil_1      -1.43611111 -2.0919759 -0.7802463 0.0000076
## Waste_Water-Soil_2      -1.18966667 -1.8455315 -0.5338019 0.0001438

This shows that the soil and water samples are fundamentally different from each other, but the two samples from each type, when compared, have similar results.

The next question is to determine which substrates are driving any differences between the samples.


We also want to know if the dilution factor changes the results. The first step would be to graph the data based on sample ID and see if there is a difference between the dilution factors for each


Finally, we want to find out if the control samples show any sign of contamination. There are two ways to do this: + look at the absorbances of the control (in this case, the water substrate) OR + check the absorbances of all the substrates for negative values

For the first method, we will subset the BioLog data set into just the control values, then see if there are any non-zero values by using the unique() function.

For the second method, we will use logical comparisons and sum the answer to check if there are any absorbance values below zero. We will do this for all three absorbance times.